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Abstract 

This  research  project  involves  an  investigation  of  parallel  processing  using  reconfigurable  logic 
devices.  The  goal  of  this  project  is  to  support  the  Naval  Research  Labs’  recent  acquisition  of  a  Cray 
XD-1  supercomputer.  A  feature  of  the  Cray  XD-1  is  that  it  contains  field  programmable  gate  arrays 
(FPGAs).  These  reconfigurable  devices  contain  hardware  whose  connections  can  be  modified  to 
target  a  specific  computation.  This  adaptability  can  significantly  improve  the  processing  speed  of 
computationally  intensive  operations. 

Recent  improvements  in  the  memory  capacity  of  FPGAs  have  spurred  interest  in  using  the  de¬ 
vices  for  arithmetic  floating-point  operations  using  the  IEEE  754  standard.  However,  adapting  a 
program  designed  to  run  on  a  sequential  processor  to  be  run  instead  on  an  EPGA  can  be  time  con¬ 
suming  and  difficult  for  anyone  lacking  significant  experience  in  hardware  design.  In  this  project, 
a  high-level  language  (HEE) — Mitrion-C  1.4 — was  used  to  reduce  some  of  this  effort.  Using  this 
language,  two  calculations  taken  from  a  ray-tracing  simulation  of  NASA’s  Moderate  Resolution 
Imaging  Spectroradiometer  (MODIS)  were  implemented  on  an  EPGA.  The  calculations  consisted 
of  floating-point  additions,  subtractions,  multiplications,  divisions,  and  square  root  extractions. 
It  was  feasible  to  perform  many  of  the  calculations  in  parallel,  leading  to  a  substantial  increase 
in  system  throughput.  Eunctionally  identical  programs  were  also  implemented  on  a  sequential 
processor — an  Opteron  275 — using  the  American  National  Standards  Institute’s  standard  for  C 
(ANSI-C). 

Those  portions  of  the  EPGA  design  and  of  the  sequential  programs  that  were  dedicated  to  per¬ 
forming  scientific  calculations  were  isolated  and  their  processing  time  was  measured  using  func¬ 
tions  written  in  ANSTC  and  calculated  by  the  sequential  processor.  In  addition,  power  consump¬ 
tion  was  measured  both  while  the  EPGA  hardware  implementation  ran  and  while  the  sequential 
program  ran.  The  results  showed  that  implementing  the  two  calculations  on  an  EPGA  was  about 
900%  faster  than  a  sequential  processor,  requiring  only  roughly  a  30%  increase  in  power  consumed. 


Keywords:  field  programmable  gate  arrays,  floating  point  arithmetic,  high-performance  reconfig¬ 
urable  computing,  Mitrion-C,  power  consumption 
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Chapter  1 
Introduction 


The  Naval  Research  Laboratory  (NRL)  acquired  a  Cray  XDl  supercomputer  in  2005.  The  system 
uses  840  AMD  Opteron  275  Dual-Core  processors  and  144  Xilinx  Virtex-II  Pro  Field-Program¬ 
mable  Gate  Arrays  (FPGAs)  [1].  The  system  was  purchased  by  the  Center  for  Computational 
Science,  located  in  the  Information  Technology  Division,  and  seeks  to  provide  high-performance 
computing  (HPC)  resources  for  Department  of  Defense  (DoD)  research  [2].  high-performance 
computing  describes  the  use  of  processors  or  computing  nodes  connected  in  parallel  to  perform 
supercomputing.  The  performance  of  an  HPC  system  is  typically  measured  in  FLOPS,  which 
refers  to  Floating-point  Operations  Per  Second. 

In  traditional  HPC  a  given  application  is  split  between  large  numbers  of  commercially-available 
sequential  processors  running  in  parallel.  This  approach  permits  high  throughput  at  relatively 
low  expense.  However,  conventional  processors  are  typically  designed  to  be  used  for  sequential 
operations.  When  heavily  parallel  problems  are  implemented  on  them,  these  processors  often 
cannot  reach  full  utilization.  Although  connecting  many  nodes  in  parallel  has  allowed  HPC  to 
be  done  using  conventional  processors,  portions  of  each  processor  needed  for  other  applications 
might  sit  idle  during  a  parallel  task,  thus  resulting  in  inefficiencies  in  both  cost  and  performance. 

Field-Programmable  Gate  Arrays  (FPGAs)  are  semiconductor  devices  containing  many  “logic 
blocks”  that  can  be  reconfigured  to  perform  basic  logic  functions,  such  as  AND,  OR,  and  NOT. 
These  basic  logic  functions  are  the  foundations  of  all  computing  tasks  and  any  other  application, 
including  arithmetic,  can  be  created  using  only  these  functions.  Because  they  are  reprogrammable, 
FPGAs  have  in  the  past  been  used  to  test  circuit  designs  before  mass  production.  However,  recent 
advances  in  FPGA  technology  have  made  it  to  feasible  to  perform  floating-point  operations  on 
FPGAs. 

NRL  purchased  the  Cray  XDl  to  test  the  applicability  of  using  FPGAs  to  accelerate  Navy 
and  Department  of  Defense  applications.  Despite  previous  research  that  shows  that  floating-point 
operations  are  not  only  possible  on  FPGAs  but  should  be  accelerated  by  their  use,  the  fact  remains 
that  customizing  an  FPGA  for  a  specific  application  is  generally  regarded  as  a  time-consuming  and 
technically  difficult  process.  Several  techniques  have  been  applied  to  simplifying  the  process  of 
programming  an  FPGA,  to  varying  levels  of  success  [3].  This  project  gathered  data  on  cost  versus 
benefit  when  implementing  a  particular  application  on  FPGAs. 
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In  this  project  the  IEEE  754  standard  [4]  was  used  for  floating-point  representation.  Although 
past  research  [5],  [6],  [7],  has  found  that  representing  floating-point  numbers  using  custom  formats 
on  EPGAs  typically  requires  fewer  resources  from  the  EPGAs  and  can  be  run  faster,  the  scientific 
community  has  largely  adopted  the  IEEE  754  standard. 

A  specific  application  was  selected  to  investigate  floating-point  operation  performance.  The 
application  was  an  optical  simulation  of  NASA’s  Moderate  Resolution  Imaging  Spectroradiometer 
(MODIS).  The  MODIS  simulation  was  chosen  for  three  reasons.  Eirst,  previous  research  has 
shown  that  the  problem  is  highly  amenable  to  parallel  processing  [8].  Second,  the  problem  can  use 
the  IEEE  754  standard.  Einally,  the  problem  requires  implementation  of  floating-point  addition, 
subtraction,  multiplication,  division,  and  square  root.  Since  these  mathematical  functions  are  the 
most  used  in  the  vast  majority  of  possible  scientific  functions,  a  study  into  their  performance  on 
EPGAs  is  applicable  to  EPGA  floating-point  operations  in  general. 

A  cost-versus-benefit  analysis  comparing  current  EPGA  technology  to  available  conventional 
processors  was  sought  in  this  project.  Modern  HPC  systems  are  expensive  to  operate  because  of 
power  and  heat  requirements.  Initial  experiences  with  high-performance  computing  using  EPGAs 
have  suggested  that  total  power  consumption  can  be  reduced  because  EPGAs  operate  at  lower 
clock  speeds  and  so  draw  less  power  per  chip.  The  only  way  EPGAs  can  show  improvements  in 
throughput  over  processors  with  higher  clock  speeds,  therefore,  is  if  they  are  customized  so  that 
their  resources  are  more  heavily  used. 

Although  the  specifications  of  current  technology  indicate  that  employing  EPGAs  over  con¬ 
ventional  processors  would  be  more  cost-effective,  these  assertions  have  yet  to  be  quantitatively 
confirmed  with  a  realistic  application.  The  EPGAs  available  on  the  Cray  XDl  have  access  to 
16  megabytes  (MB)  of  memory  for  input  and  output  [1].  Many  applications  require  more  than 
16  MB,  so  in  those  cases,  a  conventional  processor  with  access  to  larger  banks  of  memory  repeat¬ 
edly  transfers  data  between  the  EPGA  and  external  memory.  The  effects  on  power  requirements 
and  throughput  of  this  use  of  conventional  processors  in  conjunction  with  EPGAs  are  not  well 
documented.  In  this  project,  a  practical  scientific  application  was  implemented  on  commercially 
available  reconfigurable  devices  in  order  to  generate  a  cost-benefit  comparison  for  an  application 
using  both  EPGAs  and  conventional  processors. 
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Chapter  2 
Background 


2.1  Field-Programmable  Gate  Arrays 

A  useful  primer  to  understanding  FPGA  arehiteeture  is  Brown  and  Rose’s  Architecture  ofFPGAs 
and  CPLDs:  A  Tutorial  [9].  Field-Programmable  Gate  Arrays  are  a  type  of  Field-Programmable 
Deviee  (FPD).  FPDs  are  sometimes  also  known  as  Programmable  Logie  Deviees  (PLDs).  These 
terms  generally  deseribe  deviees  that  ean  be  eonfigured  by  the  user  to  implement  hardware  designs. 
Using  FPDs  allows  designers  to  test  their  designs  without  having  to  ineur  the  high  fixed  eosts 
assoeiated  with  eustom-designed  integrated  eireuits. 

Traditional  semieonduetor  deviees  implement  hardware  designs  by  ereating  deviees  and  the 
eleetrieal  eonneetions  between  them  on  a  single  integrated  eireuit.  The  preeursor  to  the  FPGA  was 
the  Mask-Programmable  Gate  Array  (MPGA).  These  deviees  eonsisted  of  transistors,  the  basie 
building  bloeks  of  almost  all  eleetrieal  eireuitry,  in  an  array  that  eould  be  eonneeted  physieally 
at  the  time  of  manufaeture  to  realize  a  eireuit  design.  However,  this  teehnique  required  that  a 
eustomized  ehip  be  fabrieated,  an  expensive  proeess  beeause  of  high  associated  fixed  costs.  The 
FPGA  applies  the  concept  of  an  MPGA  but  is  implemented  using  user-programmable  technology. 

Figure  2.1  on  the  following  page  shows  the  architecture  of  an  FPGA.  Each  of  the  input/output 
(I/O)  blocks  can  be  configured  for  input,  output,  or  bidirectional  behavior.  The  logic  blocks  can 
be  configured  to  behave  as  any  combination  of  logic  gates.  The  physical  connections  between 
logic  blocks  can  be  switched  on  or  off  by  the  user  to  connect  logic  blocks  without  the  need  for 
physical  fabrication  [10].  The  figure  gives  an  idea  of  how  any  I/O  or  logic  block  can  potentially  be 
connected  to  any  other  block  by  reconfiguring  the  FPGA’s  network  of  connections. 

Development  of  FPDs  has  recently  been  focused  mostly  on  FPGAs  because  they  employ  Dy¬ 
namic  Random  Access  Memory  (DRAM),  and  so  have  a  higher  logic  capacity  than  other  FPDs. 
Logic  capacity  refers  to  the  amount  of  logic  that  can  be  mapped  to  a  given  FPD.  It  is  usually 
compared  to  the  equivalent  number  of  logic  gates  that  would  be  available  on  a  traditional  gate 
array  [9]. 
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Programmable  Logic 


2.2  Software  Development  vs.  Hardware  Design 

In  this  report,  the  terms  software  development  and  hardware  design  are  used  to  distinguish  between 
two  different  production  methodologies.  Whereas  software  engineers  are  most  concerned  about 
the  correctness  of  their  algorithms  and  can  ignore  many  hardware  constraints,  hardware  designers 
cannot  afford  to  do  so:  from  the  outset  they  must  consider  the  actual  physical  limitations  of  the 
device  they  are  working  with.  This  is  also  why  what  could  be  considered  a  “program”  for  an 
FPGA  is  known  instead  as  a  design — ^because  hardware  code  is  literally  mapped  to  the  physical 
components  of  a  device,  and  so  has  more  in  common  with  circuit  design  than  software  programs. 

Working  with  both  hardware  and  software,  as  is  the  case  in  this  project,  can  present  challenges 
because  of  overlapping  terminology  and  cultural  differences  between  the  two  fields.  The  main 
difference  lies  in  the  level  of  abstraction.  Software  is  created  at  a  high  level  of  abstraction.  This 
means  that  software  developers  work  with  logic,  but  the  implementation  of  that  logic  is  taken  care 
of  by  the  target  device  (a  processor).  The  benefit  of  high-level  programming  is  that  software  can 
usually  be  run  on  a  wide  variety  of  devices  without  device-specific  customization. 

Hardware  design  functions  in  the  "real  world"  of  physical  things.  Hardware  designers  must 


11 


1  1 

Hardware  Design  Language  Device  Constraints 

Or  High-Level  Abstraction  Tool 


Figure  2.2:  Hardware  design  flow. 


be  aware  of  their  target  device’s  physical  constraints,  such  as  available  memory,  logic  blocks,  and 
physical  connections  between  elements.  Working  with  such  a  low  level  of  abstraction,  that  is,  at 
such  a  high  level  of  detail,  has  traditionally  meant  that  hardware  design  takes  significantly  longer 
than  software  development  [11]. 

Efforts  have  been  made  to  increase  the  abstraction  level  available  for  hardware  design  so  as 
to  allow  researchers  with  less  experience  in  the  field  to  customize  hardware  for  their  projects. 
However,  attempts  to  bridge  software  and  hardware  have  faced  difficulty  in  overcoming  a  basic 
difference  in  design  methodology.  Software  developers  design  sequential  code.  That  is,  they  act 
as  if  each  line  of  code  is  run  after  every  line  preceding  it.  Hardware  developers,  on  the  other  hand, 
must  always  think  in  parallel.  All  blocks  within  a  hardware  design  are  synchronized  by  a  clock,  but 
can  process  inputs  and  outputs  independently  of  all  other  blocks.  Figure  2.2  shows  the  hardware 
design  process  and  is  discussed  below. 

Hardware  design  can  begin  at  a  relatively  high  level  of  abstraction  if  a  high-level  language 
(HLL)  is  used.  The  development  of  these  tools  is  further  discussed  in  section  3.1  on  page  18. 
After  the  design  stage,  a  simulator  is  used  to  test  that  the  logic  of  a  hardware  design  is  functionally 
correct.  While  this  step  would  essentially  be  the  last  step  in  software  development,  a  hardware 
design  must  go  through  several  more  processes  before  it  can  be  loaded  onto  a  device.  Compilation 
of  a  hardware  design  requires  two  steps.  First,  the  synthesis  step  generates  a  device-independent 
intermediate  representation  of  the  design.  Synthesis  tells  the  designer  how  many  resources  a  de¬ 
sign  will  require;  if  the  target  hardware  lacks  sufficient  resources,  the  designer  must  return  to  the 
design  stage.  Place-and-route  is  the  second  compilation  step  and  is  only  run  if  synthesis  completes 
successfully. 

In  the  place-and-route  step,  the  structures  generated  in  synthesis  are  mapped  to  physical  com¬ 
ponents  on  the  FPGA.  This  process  is  device  dependent  and  will  fail  if  the  required  resources  are 
not  available  on  the  target  device.  Place-and-route  is  the  most  time  consuming  step  of  hardware 
design,  taking  from  thirty  minutes  to  many  hours  depending  on  the  complexity  of  the  design  and 
the  hardware  specified.  The  output  of  this  step  is  called  a  bitstream,  a  mapping  of  binary  values 
to  specific  blocks  and  connections  on  a  device.  If  place-and-route  or  synthesis  fails,  the  designer 
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Figure  2.3:  IEEE  754  Single-precision  floating-point  representation. 


must  return  to  the  design  step  to  adjust  the  design,  as  shown  in  Eigure  2.2  on  the  previous  page. 
The  final  step  of  the  process  is  the  download  of  the  bitstream  onto  hardware.  The  specifics  of  this 
step  depend  on  the  particular  hardware  being  used  [10]. 


2.3  IEEE  754  Single-Precision  Floating-Point  Representation 

Eloating-point  representation  is  a  system  used  for  representing  real  numbers.  The  name  derives 
from  the  fact  that  the  location  of  the  decimal  point  or  radix  point  of  the  number  being  represented 
is  variable.  A  floating-point  number  is  often  known  as  afloat.  The  floating-point  format  differs 
from  the  fixed-point  number  representation  system,  in  which  the  location  of  the  decimal  point  is 
constant.  Eor  example,  integer  representations  of  numbers  are  a  fixed-point  representation  because 
all  numbers  have  exactly  zero  decimal  places.  The  benefit  of  using  floating-point  representation  is 
its  ability  to  represent  a  greater  range  of  values,  which  is  often  important  to  scientific  applications. 
The  IEEE  Standard  for  Binary  Eloating-Point  Arithmetic  (IEEE  754)  is  the  industry  standard  for 
the  representation  of  floating-point  values  today  [4].  It  allows  numbers  to  be  represented  as  binary 
strings  of  I’s  and  O’s,  which  is  important  for  computing  applications.  There  are  four  floating¬ 
point  representation  formats  defined  by  IEEE  754,  of  which  only  two  are  commonly  used:  single¬ 
precision  and  double-precision.  The  IEEE  754  standard  only  requires  that  32-bits  be  used  in  the 
single -precision  representation — other  bits  are  optional.  Single-precision  numbers  are  adequate  for 
many  scientific  applications  and  the  double-precision  standard  takes  significantly  more  resources 
to  implement  on  EPGAs.  Therefore,  only  the  single-precision  floating-point  format  was  considered 
in  this  project. 

As  mentioned  before,  the  IEEE  754  single-precision  format  represents  a  real  number  using  a 
string  of  I’s  and  O’s.  Eigure  2.3  shows  how  the  32  bits  of  a  single-precision  number  are  broken 
down.  In  the  figure,  a  given  floating-point  number  /  can  be  represented  as  the  equation 


/  =  (2.1a) 

In  this  equation,  5  =  1  when  bit  31  =  1  and  5  =  0  when  bit  31  =  0.  The  exponent  field  e  is 
adjusted  by  127  to  allow  the  exponent  to  range  between  —126  and  127.  The  values  of  e  that  would 
represent  exponents  of  —127  and  128  are  instead  used  for  special  cases,  as  described  later  in  this 
section.  The  significand,  m,  always  has  a  leading  bit  of  1  for  normalized  numbers,  so  only  the 
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Table  2.1:  Truth  table  comparing  XOR  and  binary  addition. 


X 

Y 

XXORT 

X  +  Y 
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0 
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0 

0 
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1 

1 

1 

0 

1 

1 

1 

1 

0 

(1)0 

fractional  part  is  stored  in  the  floating-point  format.  By  definition,  1  <  significand  <  2.  Consider 
the  example  of  the  floating-point  representation  of  the  decimal  value  5.0  .  The  number  would 
be  represented  as  the  binary  string  0 10000001 OIOOOOOOOOOOOOOOOOOOOOO2  using  the  IEEE  754 
format.  The  description  of  the  IEEE  format,  above,  explains  how  the  32-bit  binary  string  represents 
a  decimal  number.  The  sign  bit  s  is  0,  so  the  number  is  positive.  The  exponent  field  e  is  10000001 2, 
which  converts  into  1  x  2^  +  1  X  2®  =  128  -f  1  =  129.  This  means  that  the  actual  exponent  is 
129  -  127  =  2.  Einally,  the  significand  is  (l.)010000000000000000000002,  or  1  + 1  x  2^2  _  1.25. 
After  calculating  these  values,  /  can  be  solved  /  =  1  x  2^  x  1.25  =  5.0. 

In  addition  to  representing  real  numbers,  the  IEEE  754  standard  also  provides  codes  for  in¬ 
finity  and  not-a-number,  abbreviated  NaN.  These  codes  result  from  underflow  or  overflow,  which 
occur  when  /  exceeds  the  range  of  the  IEEE  754  standard,  or  from  division  by  0.  In  addition, 
denormalized  numbers  are  numbers  where  both  the  exponent  e  and  the  leading  bit  of  the  signif¬ 
icand  are  0.  This  format  allows  representation  of  the  numbers  in  the  range  —Ex  O.frac  x  2~^26 
where  frac  is  the  fractional  part  of  the  significand  m.  This  representation  only  uses  a  portion  of 
the  precision  of  the  significand.  These  special  cases  of  the  IEEE  754  standard  are  necessary  for 
many  applications.  They  are  also  significant  to  the  implementation  of  floating-point  operations  on 
hardware  because  any  system  that  uses  the  IEEE  754  standard  must  devote  logic  to  dealing  with 
these  special  cases  [4]. 

2.4  Mathematical  Operations  with  Floating-point  Numbers 

Eloating-point  arithmetic  is  considerably  more  taxing  on  computer  systems  than  fixed-point  arith¬ 
metic.  The  additional  cost  in  resources  is  not  because  of  increased  memory  requirements — it  is 
much  more  difficult  to  multiply  two  32-bit  IEEE  754  floating-point  numbers  than  it  is  to  multiply 
two  32-bit  fixed-point  numbers.  The  difficulty  of  performing  floating-point  arithmetic  comes  from 
the  complexity  of  its  algorithms.  This  section  contrasts  the  logic  behind  fixed-point  and  floating¬ 
point  multiplication  as  an  example  of  this  complexity. 

Multiplying  two  fixed-point  binary  integers  with  digital  logic  is  relatively  simple.  Multipli¬ 
cation  is  repeated  addition.  Eor  binary  integers,  addition  follows  almost  the  same  rules  as  the 
exclusive  OR  (XOR)  logic  gate,  as  shown  in  table  2.1. 

XOR  returns  1  whenever  exactly  one  of  its  inputs  is  1 .  Eine  4  of  the  table  shows  that  when  both 
X  and  Y  are  1,  the  sum  is  0  with  a  carry  bit  of  1,  depicted  as  (1).  Resolution  of  the  carry  bit  requires 
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Figure  2.4:  Floating-point  multiplication  algorithm. 


additional  logic.  Even  so,  the  gates  required  overall  for  binary  addition  are  common  in  digital  logic 
and  require  few  resources  to  implement.  Fixed-point  multiplication  can  be  implemented  by  using 
this  algorithm  repeatedly.  Multiplication  of  floating-point  numbers,  on  the  other  hand,  requires 
several  stages  and  different  types  of  logic.  Figure  2.4  shows  the  required  logic  flow  and  is  discussed 
in  the  paragraphs  below. 

In  the  figure,  the  subscript  s  refers  to  the  sign  of  a  number,  e  to  the  exponent,  and  m  to  the 
significand.  The  two  inputs  are  X  and  Y  and  the  output  is  R.  The  algorithm  first  checks  whether 
either  X  or  7  is  zero,  in  which  case  R  is  set  to  zero  and  the  multiplication  ends.  Otherwise,  the 
resultant  sign  of  R  is  computed  by  comparing  the  sign  bits  of  X  and  Y  using  an  XOR  gate.  Next,  the 
fractional  parts  of  the  significands  of  the  inputs  are  multiplied,  using  a  fixed-point  multiplication 
method.  The  result  (Rm)  must  be  rounded  to  fit  within  the  single-precision  standard.  To  calculate 
the  exponent  of  the  result  (Rg)  the  exponents  of  the  inputs  are  added.  Also,  if  R^  is  not  between 
1  and  2,  as  the  IEEE  format  requires,  it  must  be  normalized.  This  requires  a  shift  operation  to 
be  performed  on  R^  and  for  Rg  to  be  incremented  or  decremented,  as  appropriate.  Einally,  the 
result  is  checked  for  overflow  and  underflow,  which  would  change  the  result  to  either  infinity  or 
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Figure  2.5:  Two  scheduling  techniques  applied  to  a  single  task. 


NaN.  The  example  given  demonstrates  the  potential  complexities  hardware  designers  face  when 
implementing  floating-point  arithmetic  on  reconfigurable  logic. 


2.5  Scheduling 

Many  techniques  for  implementing  floating-point  operations  on  FPGAs  exist.  Customizations  to 
data  storage  and  allocation  of  resources  have  been  made  to  increase  floating-point  performance.  In 
general,  one  customization  important  to  any  supercomputing  application  is  scheduling.  Schedul¬ 
ing  refers  to  the  order,  or  priority,  given  to  tasks  in  a  multi-task  system.  There  are  many  different 
scheduling  techniques,  each  offering  unique  benefits,  and  scheduling  of  resources  in  an  FPGA  is 
especially  important  for  fast  performance.  This  section  summarizes  some  basic  scheduling  tech¬ 
niques  relevant  to  this  project. 

Within  the  broad  topic  of  scheduling,  pipelining  describes  the  processing  of  multiple  stages 
of  the  same  operation  simultaneously.  As  this  is  a  relatively  new  field,  much  of  the  related  ter¬ 
minology  is  non-standard.  The  terminology  used  in  this  section  is  adopted  from  Hsu  and  Jeang’s 
discussion  of  pipelining  techniques  [12]  with  some  modifications  made  to  reflect  current  common 
usage.  Figure  2.5  shows  an  example  using  the  quadratic  equation.  It  is  an  example  of  what  will  be 
referred  to  as  a  task.  Anything  above  the  task  level  might  be  referred  to  as  a  system,  application, 
or  problem. 
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In  Figure  2.5  on  the  previous  page,  a  task  that  computes  the  quadratic  equation  is  split  into 
subtasks.  In  this  simplified  example,  each  subtask,  such  as  the  calculation  axe,  requires  only 
one  computational  cycle  to  complete.  Typically,  subtasks  such  as  mathematical  operations  using 
floating-point  numbers  take  multiple  cycles  to  complete.  The  two  most  widely  used  schedul¬ 
ing  techniques  are  As-Soon-As-Possible  (ASAP)  scheduling  and  As-Late-As-Possible  (ALAP) 
scheduling.  As  Figure  2.5  on  the  preceding  page  illustrates,  in  ASAP  scheduling,  subtasks  are 
scheduled  as  soon  as  all  the  subtasks  they  are  dependent  upon  have  been  scheduled.  In  contrast, 
in  ALAP  scheduling,  the  schedule  is  created  from  last  subtask  to  first.  A  subtask  is  scheduled  as 
soon  as  all  sub  tasks  dependent  upon  it  have  been  scheduled  [12].  ALAP  and  ASAP  scheduling 
are  examples  of  preprocessing  scheduling  techniques  because  they  only  ensure  that  no  conflicts  in 
dependencies  will  arise:  they  do  not  address  the  optimization  of  resource  allocation  or  through¬ 
put.  For  example,  the  ALAP  schedule  of  Figure  2.5  on  the  previous  page  requires  at  least  two 
multipliers  be  implemented  because  two  multiplications  are  scheduled  for  simultaneous  execution 
during  computational  cycle  2.  In  contrast,  the  ASAP  schedule  requires  at  least  three  multipliers  be 
implemented  because  three  multiplications  are  scheduled  during  computational  cycle  1.  However, 
both  schedules  require  six  total  computational  cycles  to  output  an  answer.  Therefore,  the  ASAP 
schedule  is  less  efficient  in  this  particular  example  because  it  requires  more  resources  to  produce 
the  same  rate  of  throughput. 

The  modulo  scheduling  technique  described  by  Rau  and  Glaser  [13]  measures  the  effectiveness 
of  a  schedule  by  considering  its  throughput  and  the  resources  required  to  implement  it.  Modulo 
scheduling  is  distinct  from  both  ASAP  and  ALAP  scheduling.  It  allows  the  calculation  of  the  min¬ 
imum  initiation  interval,  i,  that  needs  to  separate  the  initiation  of  consecutive  tasks.  The  minimum 
initiation  interval  can  be  calculated  as  z  =  m  x  T  where  m  is  the  modulus  and  T  is  the  clock  period. 
In  the  example  presented  in  Figure  2.6  on  the  following  page,  the  computational  cycles  are  with¬ 
out  units,  so  T  simplifies  to  1.  Ordinarily  the  modulus  of  a  task  is  equal  to  the  highest  number  of 
operations  any  one  functional  unit  must  perform.  In  the  quadratic  example  presented  earlier,  the 
modulus  is  4  because  four  multiplications  are  required  to  generate  a  solution.  No  other  operation 
is  required  more  times  than  this  in  a  single  task. 

A  modulus  of  4  means  that  the  minimum  initiation  interval,  z,  of  the  task  also  equals  4,  so  one 
task’s  solution  would  be  generated  every  four  computational  cycles.  However,  the  modulus  of  a 
task  can  be  decreased  by  increasing  the  number  of  available  functional  units.  If,  for  example,  two 
multipliers  were  implemented  in  the  given  example,  each  multiplier  would  only  have  to  complete 
two  operations,  so  the  modulus  would  decrease  to  2.  Figure  2.6  on  the  next  page  shows  how  an 
m  =  2  schedule  never  requires  the  use  of  more  than  two  multipliers  simultaneously.  This  would 
also  allow  the  minimum  initiation  to  decrease  to  two  computational  cycles.  If,  however,  it  was 
necessary  to  generate  one  solution  every  cycle,  more  multipliers  would  have  to  be  implemented.  In 
the  zn  =  1  schedule  four  multipliers  are  required  starting  at  cycle  5.  Implementing  four  multipliers 
corresponds  with  a  modulus  of  1.  Similarly,  two  subtraction  units  would  be  needed  at  this  point. 
The  modulus  could  not  descend  below  2  unless  a  second  such  unit  was  implemented. 

The  two  schedules  of  Figure  2.6  on  the  following  page  show  the  considerations  that  must  be 
taken  into  account  when  creating  a  pipelined  schedule  of  a  design.  The  m  =  I  schedule  would 
allow  the  generation  of  one  result  every  computational  cycle.  This  would  be  a  two-fold  throughput 
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Figure  2.6:  Modulo  scheduling  example 


increase  over  the  m  =  2  schedule,  which  can  only  generate  one  result  every  two  cycles.  However, 
implementing  the  m  =  1  schedule  would  require  three  additional  multiplier  units  and  one  additional 
subtraction  unit.  This  tradeoff  between  throughput  and  resource  consumption  is  always  on  the 
mind  of  a  hardware  designer,  since  floating-point  operations  can  easily  require  more  resources 
than  available,  even  in  modem  FPGAs. 
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Chapter  3 
Related  Work 

3.1  Implementation  of  Floating-Point  Operations  on  FPGAs 

In  1994,  early  attempts  to  implement  floating-point  operations  on  FPGAs  foeused  on  implementing 
a  design  eapable  of  adding  and  multiplying  two  IEEE  754  single-preeision  floats.  However,  it  was 
diseovered  that  the  best  design  based  on  a  eomparison  of  spaee  versus  throughput  required  more 
spaee  than  was  available  on  a  single  deviee.  As  a  result,  the  design  was  implemented  aeross  four 
EPGAs  [14].  The  researeh’s  eonelusion  was  that  EPGA  teehnology  needed  to  be  improved  before 
floating-point  eould  beeome  feasible. 

In  1996,  implementations  of  an  adder  and  a  multiplier  were  made  to  fit  onto  a  single  deviee 
eaeh.  However,  the  implementations  suffered  poor  performanee  and  aeeuraey  eompared  to  eon- 
ventional  implementations  beeause  of  resouree  eonstraints  [15].  Two  years  later,  an  IEEE  754- 
eompliant  floating-point  adder  and  multiplier  were  separately  implemented  on  a  single  EPGA 
eaeh.  It  was  speeulated  at  this  point  that  floating-point  operations  on  EPGAs  eould  potentially 
outperform  eonventional  mieroproeessors  in  speeifie  eireumstanees. 

The  spaee  required  to  fully  implement  IEEE  754  floating-point  units  eontinued  to  represent  a 
bottleneek  to  development,  despite  the  fact  that  hardware  engineers  operate  at  very  low  levels  of 
abstraction  with  EPGAs.  As  a  result,  several  attempts  were  made  to  represent  floating-point  num¬ 
bers  and  implement  arithmetic  operations  without  using  the  IEEE  754  standard.  Some  techniques 
proved  more  effective  than  others.  Eloating-point  to  fixed-point  conversion  involves  multiplying 
a  floating-point  value  by  a  large  number  and  treating  the  result  as  an  integer,  performing  integer 
arithmetic,  and  then  converting  the  resulting  integer  back  into  a  float.  This  method  was  shown 
to  be  slower  and  more  resource-consuming  than  a  comparable  floating-point  implementation  [16]. 
By  contrast,  bit-width  optimization,  which  allows  the  required  accuracy  to  determine  how  many 
bits  are  actually  used  to  represent  a  float  and  only  use  that  many  bits,  showed  improvement  over 
the  IEEE  754  implementation  [17]. 

However,  techniques  that  avoided  using  IEEE  754  were  unable  to  match  the  standard’s  pre¬ 
cision,  range,  and  treatment  of  non-real  or  out-of-bounds  situations.  Therefore,  the  industry 
has  for  the  most  part  adopted  IEEE  754  as  standard.  The  initial  problems  with  implementing 
IEEE  754  seemed  to  result  from  technological  rather  than  methodological  limitations.  Underwood 
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made  predictions  of  FPGA  IEEE  754  floating-point  performance,  concluding  that  EPGAs  would 
show  an  order-of-magnitude  performance  advantage  over  comparable  conventional  processors  by 
2009  [18], 

These  predictions  have  led  the  industry  to  pursue  technological  development  in  high-perform¬ 
ance  reconfigurable  computing  (HPRC),  a  term  used  to  describe  traditional  HPC  using  EPGAs.  At 
the  same  time,  techniques  for  automating  the  EPGA  customization  process  began  to  be  explored. 
The  attempts  to  extract  peak  performance  from  EPGAs  described  in  this  section  all  involved  signif¬ 
icant  knowledge  of  EPGA  architecture  and  hardware  design.  Up  to  this  point,  designers  have  used 
hardware  description  languages  (HDEs)  such  as  Verilog  and  Very  High-Speed  Integrated  Circuit 
HDE  (VHDE)  to  describe  their  algorithms.  This  code  is  then  compiled  by  an  electronic  design  au¬ 
tomation  tool  (EDA)  into  a  physical  design  targeted  to  a  specific  device.  However,  the  amount  of 
logic  that  can  be  placed  on  a  single  chip  has  grown  to  the  point  that  very  complex  algorithms  can  be 
implemented  on  a  single  EPGA,  and  so  a  need  for  a  higher  level  of  abstraction  has  developed  [19]. 

Developing  algorithms  at  a  higher  level  of  abstraction  for  hardware  design  also  allows  software 
engineers  to  use  EPGAs  to  accelerate  their  applications  without  having  to  learn  an  entirely  new 
design  methodology.  However,  the  field  of  HPRC  is  still  in  the  developmental  stages,  and  high- 
level  languages  in  particular  have  a  long  way  to  go.  A  wide  range  of  commercial  and  open-source 
HEEs  has  been  developed.  Some  are  easier  to  use  than  others,  but  none  has  become  standard  across 
the  industry  [3]. 


3.2  Implementation  of  the  MODIS  System 

This  project  applied  EPGA  floating-point  acceleration  to  a  real-world  scientific  application.  This 
technique  was  chosen  for  two  reasons.  Eirst,  it  was  the  intent  of  this  project  to  be  useful  to  other 
researchers  interested  in  using  EPGAs  for  software  acceleration,  but  having  little  background  in 
HPRC,  rather  than  to  hardware  designers  already  familiar  with  the  field.  Second,  a  substantial 
amount  of  research  into  HPC  and  HPRC  implementations  of  the  chosen  application  has  already 
been  done,  and  a  cache  of  data  is  available  for  comparison  purposes. 

The  application  was  an  optical  ray-tracing  simulation  of  NASA’s  Moderate  Resolution  Imaging 
Spectroradiometer  (MODIS).  In  the  spectroradiometer,  light  from  the  sun  passes  through  multiple 
optical  elements  before  reaching  a  detector.  To  simulate  this  system,  each  interaction  of  a  ray 
of  light  with  an  optical  element  must  be  simulated.  This  interaction  entails  several  important 
calculations,  of  which  two  were  of  primary  focus  in  this  project:  (1)  finding  the  point  at  which  a 
ray  intersects  an  optical  surface  and  (2)  finding  the  direction  of  the  ray’s  travel  after  interacting 
with  that  surface  [20].  These  steps  can  be  simplified  into  a  system  of  floating-point  equations  with 
a  constant  number  of  inputs  and  outputs. 

The  MODIS  simulation  was  initially  implemented  using  Eortran  on  a  Digital  Equipment  Cor¬ 
poration  (since  bought  by  Hewlett-Packard)  Alpha  3000  series  model  800  computer.  The  slow 
performance  of  this  first  program  prompted  interest  in  methods  to  speed  up  processing.  Cameron 
et  al.  began  work  in  2002  on  implementing  the  MODIS  simulation  on  multiple  digital- signal¬ 
processing  chips.  A  system  functionally  comparable  to  the  original  Eortran  model  was  written  in 
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Measured  Performance  vs.  Amdahl's  Law 


Figure  3.1:  Amdahl’s  law  versus  measured  performanee. 

the  C  programming  language.  The  complete  simulation  was  implemented  and  tested  successfully. 
Initial  estimates  were  that  using  digital  signal  processors  (DSPs)  would  show  a  linear  relationship 
between  the  number  of  DSPs  used  and  speedup  [21].  This  result  was  confirmed  using  eight  DSPs 
simultaneously  running  a  complete  MODIS  simulation  [22]. 

Using  the  Message  Passing  Interface  (MPI),  the  MODIS  simulation  was  subsequently  imple¬ 
mented  on  the  Naval  Research  Laboratory’s  massively  parallel  Cray  XDl.  Tum-around-time  data 
was  collected  for  using  a  single  processor  and  for  using  multiple  processors  in  parallel  [8].  Data 
points  were  measured  for  varying  numbers  of  processors  in  the  range  from  1  to  839.  These  data 
are  reproduced  in  Figure  3.1  as  the  line  labeled  "MODIS  simulation". 

The  performance  of  parallel  computing  applications  is  often  measured  in  speedup,  which  is  the 
ratio  between  original  throughput  and  improved  throughput,  or  ^  =  p /Preference-  Figure  3.1  graphs 
Amdahl’s  Law,  which  shows  that  speedup  s  can  be  increased  by  implementing  a  parallel  application 
using  additional  processors.  In  this  specific  case,  5  =  p„/pi  where  pi  refers  to  the  throughput 
achieved  using  one  processor  and  p„  refers  to  the  throughput  achieved  using  n  processors.  It  also 
shows  that  speedup  is  only  maximized  when  —  0,  where  is  the  portion  of  a  calculation  that 
must  be  performed  sequentially  [23].  The  speedup  5  can  be  measured  using  throughput,  but  5  can 
also  be  calculated  using  the  formula: 


s  = 


(3.1a) 


where  5  is  speedup,  is  the  serial  component  of  the  process  or  calculation  being  implemented, 
is  the  parallel  component,  and  n  is  the  number  of  processors  being  used  for  processing.  It  is  also 
important  to  note  that  by  definition,  rs  —  I  — rp. 
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This  calculation  of  s  allows  a  designer  to  predict  whether  it  would  be  worthwhile  to  implement  a 
given  application  using  hardware  that  benefits  parallelism.  Using  multiple  processors,  as  discussed 
in  section  1  on  page  7,  is  one  way  to  implement  HPC.  To  gauge  cost  versus  benefit,  the  speedup 
gained  would  be  compared  to  the  extra  time,  money,  and  labor  required  to  implement  a  given 
application  on  multiple  processors. 

The  throughput  measured  when  implementing  the  MODIS  system  on  multiple  conventional 
processors  is  plotted  against  examples  of  Amdahl’s  Law  at  different  values  in  Figure  3.1  on  the 
preceding  page.  Amdahl’s  Law  was  used  to  estimate  the  components  and  for  the  MODIS 
simulation  system.  The  graph  shows  that  the  MODIS  simulation  had  a  very  low  and  so  an  rp 
value  very  close  to  1 .  Therefore,  it  was  concluded  that  the  MODIS  simulation  would  be  a  good 
candidate  for  acceleration  using  FPGAs,  since  FPGAs  are  best  suited  to  highly  parallel,  data- 
intensive  applications. 

The  throughput  pi  of  the  MODIS  simulation  when  implemented  using  a  single  processor  was 
6.95  X  10^  rays  per  second.  Although  the  simulation  ran  very  quickly  using  the  Cray’s  conventional 
processors,  it  did  not  utilize  any  of  the  supercomputer’s  FPGA  processing  capabilities  [8]. 

Cameron  implemented  the  entire  MODIS  simulation  using  the  Cray  XDl’s  Advanced  Micro 
Devices  Opteron  275  processors.  However,  such  a  complex  system  was  not  likely  to  fit  on  a  single 
FPGA.  Instead,  only  the  first  two  steps  of  the  simulation  were  implemented  on  FPGA’s  in  this 
project.  The  first  step  was  the  calculation  of  the  point  where  a  ray  intersected  a  conicoid  surface. 
The  second  step  was  the  calculation  of  the  vector  normal  to  the  surface  at  the  point  of  intersection. 
These  steps  are  referred  to  as  the  ray-intersection  calculation  and  the  normal-vector  calculation, 
respectively,  for  the  rest  of  this  report.  These  two  calculations  are  further  described  in  section  4.3 
on  page  26. 

3.3  The  Trident  Compiler 

This  project  initially  selected  the  Trident  compiler  for  implementation  of  the  normal-vector  cal¬ 
culation.  The  compiler  uses  a  novel  approach  to  convert  sequential  C  code  into  hardware  design 
language  code.  It  first  compiles  the  C  code  into  an  intermediary  representation.  It  then  parses  this 
representation  to  automatically  extract  parallelisms.  Finally,  it  schedules  operations  and  produces 
VHDL  code  based  on  its  analysis  of  parallelisms. 

According  to  its  documentation,  the  Trident  compiler  was  designed  to  support  ASAP,  ALAP, 
and  modulo  scheduling  (see  section  2.5  on  page  15  for  a  description  of  these  methods)  among 
other  scheduling  methods  [24].  To  test  these  claims,  the  source  code  of  the  compiler  was  modified 
to  implement  floating-point  arithmetic  using  Floating-Point  Operator  v2.0,  packaged  with  Xilinx 
Integrated  Synthesis  Environment  (ISE)  version  8.1i. 

Using  this  modified  version  of  Trident,  a  simple  single-precision  floating-point  multiplier  was 
implemented  and  functional  accuracy  was  identified  with  certainty.  However,  the  scheduling  tech¬ 
nique  being  implemented  could  not  be  confirmed,  since  only  one  arithmetic  operation  was  imple¬ 
mented.  Subsequently,  the  normal-vector  calculation  was  implemented  using  Trident.  However,  a 
simulation  of  the  resulting  VHDE  code  showed  anomalies  in  scheduling.  The  conclusion  was  that 
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that  modulo  scheduling  was  not  implemented. 

Without  proper  scheduling,  the  VHDL  generated  by  Trident  could  not  have  used  the  resources 
of  the  FPGA  adequately.  Attempts  to  contact  the  developers  of  Trident  to  address  these  issues  were 
unsuccessful.  As  a  result,  the  focus  of  this  project  shifted  to  a  different  tool — Mitrion-C. 


3.4  Previous  Use  of  Mitrion-C 

High-level  abstraction  tools  for  hardware  design  serve  two  purposes:  (1)  they  simplify  the  ex¬ 
pression  of  large,  complex  algorithms,  and  (2)  they  simplify  hardware  design  for  researchers  only 
familiar  with  software  development.  Mitrion-C  is  a  high-level  language  that  is  part  of  the  Mitrion 
Integrated  Development  Environment  (IDE).  The  function  of  both  Mitrion-C  and  the  Mitrion  IDE 
are  further  discussed  in  section  4. 1  on  the  next  page 

Because  of  its  relatively  recent  development,  little  work  has  been  implemented  using  Mitrion- 
C.  All  of  the  significant  related  work  was  published  in  2007.  Koo  et  al.  compared  EPGA  per¬ 
formance  using  Mitrion-C  to  a  software  implementation  using  ANSTC  on  the  Silicon  Graphics, 
Inc.  (SGI)  Reconfigurable  Application  Specific  Computing  (RASC)  RClOO  platform,  using  four 
Virtex-4  EX200  EPGAs.  In  the  case  of  an  MRI  brain  scan  analysis  algorithm,  overall  speedup  was 
3.6x,  but  the  speedup  of  the  portion  implemented  using  EPGAs  was  11. 6x  [25]. 

Koo  et  al.  also  used  Mitrion-C  to  implement  two  other  algorithms — the  first  was  a  floating¬ 
point  dense  matrix-vector  multiplication  and  the  second  was  an  algorithm  to  simulate  solvating 
protein  in  water.  Comparing  the  implementation  on  a  single  EPGA  versus  implementation  on  a 
single  1.5  GHz  Itanium  2  sequential  processor,  maximum  speedup  for  the  first  algorithm  was  21  x 
and  for  the  second  was  10  x .  Speedup  was  also  shown  to  increase  significantly  when  using  multiple 
EPGAs  [26]. 

Kindratenko  et  al.  measured  speedup  comparing  the  performance  of  two  SGI  RCIOO  EPGAs 
to  that  of  two  1.4  GHz  Intel  Itanium  2  sequential  processors.  Mitrion-C  was  used  to  generate 
the  EPGA  hardware  design.  The  algorithm  concerned  the  calculation  of  the  two-point  correlation 
function,  used  to  analyze  the  clustering  of  extragalactic  objects  [27].  In  the  best-case  scenario, 
speedup  was  measured  to  be  9.5  x  [28].  In  each  of  the  three  reports  mentioned  above,  resource 
consumption  on  the  target  EPGAs  was  reported  and  varied  from  case  to  case.  No  correlation 
between  resource  consumption  and  speedup  was  made. 

Speedup  using  EPGAs  was  verified  by  several  independent  projects,  but  the  benefit  of  using 
high-level  languages  over  VHDE  or  other  traditional  HDE  techniques  was  not  addressed  in  these 
reports.  El-Araby  et  al.  sought  to  quantify  the  "comparative  productivity"  of  various  high-level 
abstraction  tools  compared  with  traditional  HDE  design.  Mitrion-C  ranked  poorly  according  to  the 
metrics  of  efficiency  and  ease-of-use  and  was  also  only  able  to  achieve  about  60%  of  the  throughput 
of  a  manually-coded  VHDE  solution.  However,  none  of  the  high-level  tools  proved  to  be  clearly 
superior  in  each  of  the  four  applications  implemented  [29]. 

Most  research  using  Mitrion-C  thus  far  has  focused  on  speedup.  No  reports  have  quantified  the 
relationship  between  speedup  and  power  consumption.  However,  Mitrionics  AB  has  repeatedly 
marketed  the  Mitrion  platform  as  a  low-power  solution  [30]. 
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Chapter  4 

Implementation 

4.1  The  Mitrion  Platform 


t  t 

Hardware  Design  Language  Device  Constraints 

Or  High-Level  Abstraction  Tool 


Figure  4. 1 :  Hardware  design  flow. 

The  high-level  language  Mitrion-C  is  part  of  the  Mitrion  Integrated  Development  Environment 
(IDE).  It  is  a  “C-like”  language  in  that  it  uses  syntax  similar  to  that  of  the  American  National 
Standards  Institute’s  standard  for  C  (ANSI-C).  Mitrion-C  gives  designers  the  ability  to  focus  on  the 
logic  of  an  algorithm  rather  than  hardware  specifics.  Parallelism  is  expressed  using  data  structures 
and  loop  constructs.  This  system  gives  Mitrion-C  the  feel  of  a  software  language,  but  allows 
explicit  expression  of  parallelism  as  well.  The  Mitrion  IDE  converts  Mitrion-C  programs  into 
VHDE,  which  can  then  be  synthesized  and  placed-and-routed  using  the  Xilinx  ISE  [31]. 

Generic  hardware  design  flow  was  first  discussed  in  section  2.2  on  page  10.  Eigure  2.2  on 
page  II  is  reproduced  here  as  figure  4.1  for  convenience.  Eigure  4.2  on  the  next  page  illustrates 
the  hardware  design  flow  specific  to  the  implementation  using  the  Mitrion  IDE.  On  the  surface, 
it  appears  that  little  difference  exists  between  the  two  illustrations.  In  fact,  all  the  same  steps  are 
present,  since  hardware  design  using  Mitrion-C  is  still  hardware  design,  despite  the  user  interface 
allowing  a  higher  level  of  abstraction.  However,  the  Mitrion  IDE  does  offer  some  benefits  over 
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Figure  4.2:  Mitrion-C  Design  Flow. 


hardware  design  using  VHDL. 

Synthesis  of  VHDL  code  can  take  several  minutes  to  complete  and  functional  simulation  is 
only  possible  after  code  has  been  synthesized.  In  addition,  synthesis  will  complete  successfully 
even  if  it  is  unlikely  a  design  will  fit  on  the  target  FPGA.  In  contrast,  the  Mitrion  IDE  includes 
a  functional  simulator  that  produces  a  graphic  representation  of  an  algorithm  in  moments.  This 
simulator  will  also  estimate  the  amount  of  resources  a  design  is  likely  to  require. 

In  sum,  the  Mitrion  IDE  allows  the  user  to  spend  the  majority  of  development  time  between 
code  development  and  simulation,  as  shown  by  the  small  loop  in  figure  4.2.  No  feedback  is  needed 
after  synthesis  because  the  Mitrion-C  compiler  creates  the  VHDE  code  automatically.  The  feed¬ 
back  line  after  place-and-route  is  dashed  to  show  that  the  Mitrion-C  compiler  will  automatically 
check  to  make  sure  a  design  will  most  likely  fit  on  the  target  hardware.  This  check  reduces  the 
number  of  failures  at  the  place-and-route  step  significantly. 

Mitrion-C  and  the  Mitrion  IDE  are  both  part  of  the  Mitrion  Software  Development  Kit  (SDK). 
The  Mitrion  SDK  also  contains  a  graphical  simulation  tool  and  libraries  that  allow  a  host  computer 
to  interface  with  the  Mitrion  Virtual  Processor  (MVP).  This  processor  is  the  core  of  the  Mitrion 
Platform.  It  is  a  reconfigurable  software  architecture  that  runs  Mitrion-C  code.  Eor  each  unique 
application,  the  Mitrion  SDK  creates  a  new  virtual  processor  tailored  to  the  targeted  EPGA  and 
optimized  for  the  application  [31].  Mitrion-C  code  defines  the  customization  of  each  Mitrion  Vir¬ 
tual  Processor.  The  use  of  this  intermediary  step  explains  how  the  user  is  able  to  quickly  simulate 
and  debug  Mitrion-C  code:  the  same  virtual  processor  that  can  be  simulated  with  the  simulator 
included  in  the  Mitrion  SDK  can  also  be  loaded  directly  onto  an  EPGA. 
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Figure  4.3:  Cray  XDl  architecture. 

4.2  The  Cray  XDl  Architecture 

The  Cray  XDl  supercomputer  is  designed  to  permit  high-performance  reconfigurable  computing 
(HPRC).  Figure  4.3  illustrates  the  components  most  significant  to  this  report.  Each  chassis  of  a 
Cray  XDl  contains  up  to  12  AMD  Opteron  275  sequential  processors,  distributed  either  two  or  four 
to  each  compute  node.  The  research  collected  in  this  report  used  nodes  containing  two  Opterons. 

Figure  4.4  on  the  following  page  illustrates  the  memory  connections  of  the  Cray  XDl.  The 
Opterons  have  access  to  host  memory,  Synchronous  Dynamic  Random  Access  Memory  (SDRAM) 
like  that  found  in  most  personal  computers.  The  Virtex-II  Pro  FPGAs,  however,  do  not  have  direct 
access  to  this  memory.  Connected  to  the  FPGAs  are  four  banks  of  Quad-Data  Rate  (QDR)  Static 
Random  Access  Memories  (SRAMs),  a  cache  to  which  the  FPGAs  have  direct  access.  Each  QDR 
SRAM  holds  4MB  of  memory  for  a  total  of  cache  size  of  16MB. 

Each  memory  bus  is  64  bits  wide,  capable  of  transmitting  two  32-bit  single-precision  floating¬ 
point  numbers  at  once.  The  QDR  SRAMs  are  only  accessible  to  the  Opterons  through  the  RapidAr¬ 
ray  Interconnect  System,  which  is  designed  to  allow  communication  between  EPGA  and  Opteron 
with  high  bandwidth  and  low  latency.  The  RapidArray  Interconnect  System  also  allows  the  pro¬ 
cessors  of  each  computing  node  to  communicate  with  other  nodes,  but  this  feature  was  not  used  in 
this  project  [1]. 

Although  it  would  seem  that  access  to  the  QDR  SRAMs  would  require  complex  C  code  on 
the  Opterons,  Cray  created  an  interface  that  simplifies  the  process.  The  user  is  given  functions  that 
map  the  Opteron’s  virtual  address  space  directly  into  the  QDR  SRAMs.  Therefore,  the  programmer 
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Figure  4.4:  Cray  XDl  memory  connections. 

only  needs  to  interact  with  normal  ANSI-C  arrays  to  write  to  and  read  from  the  FPGA’s  memory. 
Figure  4.5  on  the  next  page  illustrates  how  the  programmer  can  think  of  the  interaction  between 
Opteron  and  Virtex-II  Pro. 


4.3  Description  of  the  Calculations  Implemented 

As  discussed  in  section  3.2  on  page  19,  two  steps  of  the  MODIS  simulation — the  ray-intersection 
calculation  and  the  normal-vector  calculation — were  implemented  in  this  project.  The  inputs  to 
the  ray-intersection  calculation  were  the  point  of  origin  po  (given  expressed  using  the  Cartesian 
coordinates  xq,  yo,  and  zo)>  the  initial  direction  vector  Sq  (given  as  the  direction  cosines  L,  M,  and 
N),  the  curvature  c  of  the  conicoid  and  the  conic  constant  k,  which  depends  upon  the  conicoid’s 
type,  as  shown  in  table  4.1  on  the  next  page  [32]. 

The  output  was  the  point  of  intersection  pi  (given  as  xi,  yi,  and  z\).  Figure  4.6  on  page  28 
illustrates  the  calculation.  In  the  figure,  the  conic  surface  represents  a  cross-section  of  a  conicoid. 
All  the  points  in  this  cross-section  have  coordinates  with  x  =  0. 

The  equations  with  which  pi  is  calculated  are  listed  in  section  4.4.2  on  page  32.  Once  the 
intersection  of  a  ray  with  a  conic  surface  has  been  found,  it  can  be  used  as  the  originating  point  po 
for  the  next  interaction  with  an  optical  element.  However,  the  direction  of  the  resultant  ray  must 
first  be  found.  Figure  4.6  on  page  28  illustrates  the  calculation. 
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Figure  4.5:  Data  flow  between  host  and  FPGA  programs. 


Table  4.1:  Conie  eonstants  and  eonieoid  types. 


Conie  Constant 

Conieoid 

k>0 

oblate  ellipsoid 

k  =  0 

sphere 

-l<k<0 

prolate  ellipsoid 

k^-1 

paraboloid 

k<-l 

hyperboloid 
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Figure  4.6:  Interaction  of  an  incident  ray  with  a  conic  surface. 

After  interaction  with  an  optical  element,  a  ray  may  either  be  reflected  or  refracted.  The  cal¬ 
culation  of  the  ray’s  path  after  either  reflection  or  refraction  is  dependent  upon  the  value  of  the 
vector,  aAf,  normal  to  the  conic  surface  at  the  point  of  intersection  pi.  In  the  case  of  reflection,  the 
angle  0i  the  original  ray  makes  with  is  equal  to  the  angle  between  a^v  and  the  unit  vector  of  the 
resultant  ray  arefiected- 

In  the  case  of  refraction,  the  direction  of  the  resultant  ray  arefracted  can  be  found  by  solving 
Snell’s  equation 


ni  sin  01  =  ^2  sin  02 


(4.1a) 


or 


.  nisin0i 

02  =  arcsm - 

ni 


(4.2a) 


where  both  0i  and  02  are  given  relative  to  Sa?.  The  calculation  of  depends  on  the  x  and  y  co¬ 
ordinates  at  the  point  of  intersection.  These  coordinates  result  from  the  ray-intersection  calculation 
and  are  provided  as  inputs  to  the  normal-vector  calculation.  The  approach  taken  with  this  prob¬ 
lem  was  to  use  a  different  coordinate  system  with  its  origin  at  the  vertex  of  each  optical  element. 
Therefore,  the  origin  (0, 0, 0)  represents  the  vertex  of  the  conicoid. 
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The  normal-vector  calculation  takes  as  inputs  Xi,  yi,  curvature  c,  and  u — a  parameter  derived 
from  c  and  the  conic  constant  k.  The  variable  zi  is  not  needed  to  calculate  the  normal  vector.  The 
outputs  of  the  calculation  are  the  three  components  of  the  resultant  normal  vector  Sat  (/,  fdx,  and 
fdy).  The  remaining  equations  of  the  normal- vector  calculation  are  listed  in  section  4.4.1. 

4.4  The  FPGA  Design 

4.4.1  Implementation  of  the  Normal- Vector  Calculation 

The  hardware  designs  loaded  onto  the  Virtex-II  Pro  FPGAs  were  created  using  Mitrion-C.  Both 
the  Mitrion-C  program  for  the  ray-intersection  calculation  and  the  normal- vector  calculation  used 
four  functions.  The  normal- vector  calculation  is  discussed  first  because  it  is  functionally  simpler. 
The  source  code  can  be  found  in  appendix  B  on  page  55. 

The  optical  ray-tracing  procedure  of  the  normal-vector  calculation  was  explained  in  section  4.3 
on  page  26.  For  the  purposes  of  implementation,  the  process  simplifies  into  a  system  of  arithmetic 
operations,  represented  here: 


V  =  u{x^  +y^) 

(4.3a) 

a  =  a/1  —  V 

(4.3b) 

p  =  \  +a 

(4.3c) 

q  =  ap 

(4.3d) 

r  =  pq 

(4.3e) 

s  =  2q 

(4.3f) 

w  =  cjr 

(4.3g) 

b  =  w(5  -|- v) 

(4.3h) 

dx  =  bx 

(4.3i) 

dy  =  by 

(4.3j) 

e  =  a/ dx^  dy^  -|-  1 

(4.3k) 

/=  1/^ 

(4.31) 

fdx  =  fdx 

(4.3m) 

fdy  =  fdy 

(4.3n) 

ayv  =  {fdxjdyj) 

(4.3o) 

The  problem  requires  5  additions,  1  subtraction,  13  multiplications,  2  divisions,  and  2  square 
roots.  The  addition  of  constants  was  implemented  as  a  floating-point  operation  to  ensure  precision. 
The  numbers  of  floating-point  units  implemented  that  are  reported  here  reflect  the  output  of  the 
simulator  packaged  with  Mitrion-C. 

The  read_inputs( )  function  reads  one  64-bit  word  (or  string  of  bits)  from  each  of  two  QDR 
SRAMs  each  computational  cycle.  Since  only  four  inputs  were  needed  for  the  normal-vector 
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Figure  4.7:  Detailed  data  flow  between  normal-vector  calculation  host  and  FPGA  programs. 


calculation,  QDR  SRAMs  0  and  1  were  used  for  input.  The  reading  of  inputs  from  the  QDR 
SRAMs  assumes  that  data  is  available,  that  is,  that  appropriately  formatted  floating-point  inputs 
have  been  stored  in  the  memory  in  the  order  to  be  read.  This  requirement  must  be  satisfied  by  the 
host  program  run  on  the  Opteron.  The  host  program  source  code  for  the  normal- vector  calculation 
can  be  found  in  appendix  D  on  page  67. 

The  host  code  associated  an  ANSI-C  array  with  memory  addresses  on  the  QDR  SRAMs  using 
the  function  mitrion _processor_reg_bujfer.  These  virtual  buffers  must  be  declared  as  a  data  type 
and  the  buffers’  memory  addresses  (as  they  appear  to  the  host  program)  are  defined  by  the  size  of 
the  data  type.  Since  ANSI-C  natively  supports  32-bit  single-precision  floating-point  representation 
as  floats,  the  buffers  were  simply  declared  as  floats. 

Data  was  written  by  the  host  program  as  two  32-bits  floats  per  memory.  However,  in  Mitrion-C 
the  QDR  SRAMs  may  only  be  read  one  64-bit  word  at  a  time.  The  Mitrion-C  program  was  written 
to  read  four  32-bit  floats  as  two  64-bit  words,  split  that  word  into  four  32-bit  words,  and  then 
associate  those  words  with  the  floating-point  format  in  Mitrion-C.  Only  then  could  floating-point 
arithmetic  be  correctly  implemented  on  the  original  four  floats.  Figure  4.7  illustrates  this  process, 
using  the  normal- vector  calculation  as  an  example. 

Arithmetic  was  implemented  in  the  calc_outputs()  function.  Mitrion-C ’s  syntax  for  arithmetic 
can  be  used  as  if  it  were  sequential  ANSI-C.  The  Mitrion  compiler  removes  the  need  for  the 
programmer  to  understand  scheduling,  floating-point  units,  or  other  hardware  considerations. 

The  outputs  of  calc_outputs( )  were  fed  to  the  write _outputs( )  function,  which  wrote  two  input 
floating-point  numbers  to  a  QDR  SRAM.  It  first  had  to  pack  the  floating-point  numbers  into  64- 
bit  words,  reversing  the  process  implemented  in  read_inputs().  Since  only  three  outputs  were 
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Figure  4.8:  Mitrion-C  simulation  of  normal-vector  calculation. 


generated,  the  floating-point  value  0.0  was  written  into  the  second  half  of  QDR  SRAM  3  because 
the  32-bit  output  could  not  be  written  into  a  64-bit  space.  This  choice  made  no  difference  to 
throughput  since  it  always  takes  one  computational  cycle  to  perform  a  memory  write,  regardless 
of  the  value  of  the  actual  data. 

The  three  functions  read_inputs(),  calc_outputs(),  and  write _outputs()  were  controlled  by  the 
niain()  function.  The  three  functions  were  implemented  in  foreach  loops.  In  Mitrion-C,  when 
this  type  of  loop  is  implemented  over  a  list  of  values,  the  compiler  automatically  pipelines  the 
code  within  and  executes  it  in  parallel.  Further  explanation  of  pipelining  and  scheduling  is  pro¬ 
vided  in  section  2.5  on  page  15.  Figure  4.8  highlights  the  data  dependencies  in  the  normal-vector 
implementation.  Data  flows  from  top  to  bottom  in  the  figure,  but  data  can  be  transferred  as  soon 
as  it  is  read —  that  is,  the  calc_outputs( )  function  does  not  need  for  every  sample  to  be  read  by 
read_inputs()  before  it  can  begin  arithmetic  on  the  inputs. 

As  mentioned  before,  the  size  of  each  QDR  SRAM  available  to  the  FPGA  was  4  MB.  There¬ 
fore,  the  number  samples  that  could  be  loaded  into  the  memory  can  be  found  using: 


Samples  =  2  SRAMs  x 


4  MB 
SRAM 


X 


1048  576  bits  1  float 

MB  ^  32  bits 


1  sample 
4  floats 


528244  =  (2^^)  (4.4a) 


The  FPGA  program  was  looped  by  the  host  program  2048  (2^^)  times,  for  a  total  of  2^^  x  2^^  = 
1073741824(230)  samples,  in  order  to  simulate  the  calculation  of  a  large  dataset.  The  data  were 
generated  by  the  host  program  and  reflected  realistic  MODIS  simulation  inputs.  Elapsed  time  was 
measured  using  the  clock  ()  function  of  ANSI-C. 
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Figure  4.9:  Mitrion-C  simulation  of  ray-intersection  calculation. 

4.4.2  Implementation  of  the  Ray-Intersection  Calculation 

The  primary  difference  between  the  ray-intersection  calculation  and  the  normal- vector  calculation 
was  the  fact  that  the  latter  calculation  requires  eight  inputs. 

The  ray-intersection  calculation  can  be  simplified  into  a  system  of  floating  point  equations, 
shown  below. 


g  =  N  -  c{xQL  +  yQM  +  {k+  1.0)zoA^) 

(4.5a) 

h  =  c(xo  -\-yQ-\-  {]<.+  1.0)zo)  ~  ^^0 

(4.5b) 

f  =  c{l+kN^) 

(4.5c) 

h 

(4.5d) 

g+\/g  -fh 

=  uL  +  Xq 

(4.5e) 

yi  =  uM  -hyo 

(4.5f) 

zi  =uN  +  zo 

(4.5g) 

This  problem  therefore  requires  11  floating-point  additions,  3  subtractions,  19  multiplications, 
1  division,  and  1  square  root.  The  addition  of  constants  was  implemented  as  a  floating-point 
operation  to  ensure  precision.  The  numbers  of  floating-point  units  implemented  reported  here 
reflect  the  output  of  the  simulator  packaged  with  Mitrion-C. 

The  FPGA  program  is  listed  in  appendix  A  on  page  50.  Since  the  FPGA’s  QDR  SRAMs  can 
only  hold  4  MB  of  data  each,  the  number  of  samples  that  could  be  loaded  into  two  memories  was 
half  the  number  that  could  be  stored  in  the  normal-vector  calculation  (see  equation  4.4a).  Since 
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the  ray-intersection  calculation  requires  eight  inputs,  consisting  of  8  x  32  =  256  bits,  it  would 
be  impossible  to  read  all  eight  inputs  from  just  two  64-bit  memories  (a  total  of  128  bits)  in  one 
computational  cycle.  There  were  two  options  to  address  this  issue:  either  use  four  memories  for 
input,  or  continue  to  use  two  SRAMs,  but  allow  more  computational  time  to  read  the  inputs. 

Mitrion-C  has  built  into  it  functions  that  ensure  that  a  read  or  write  is  not  attempted  of  a  memory 
until  that  memory  reports  that  it  is  in  the  ready  state.  Therefore,  it  should  have  been  possible  to 
read  eight  inputs  from  four  memories  and  simply  write  the  three  outputs  to  QDR  SRAMs  2  and 
3,  overwriting  the  original  inputs.  This  approach  may  have  worked,  but  continuing  to  use  only 
two  memories  presented  a  chance  to  analyze  the  effects  of  large  numbers  of  inputs  on  FPGA 
processing.  There  are  undoubtedly  many  applications  that  would  have  required  more  inputs  than 
even  four  memories  could  have  provided  in  a  single  computational  cycle.  Using  two  computational 
cycles  would  provide  valuable  insight  on  the  effects  on  throughput  in  such  a  case.  Therefore, 
the  eight  inputs  were  stored  in  QDR  SRAMs  0  and  1  in  a  staggered  fashion  and  the  Mitrion-C 
function  read_inputs()  required  two  computational  cycles  to  execute.  The  effects  on  throughput 
are  discussed  in  section  5.2  on  page  38. 

The  Mitrion  simulation  of  the  ray-intersection  calculation  is  presented  in  figure  4.9  on  the 
preceding  page.  The  notable  difference  between  this  figure  and  figure  4.8  on  page  31  is  the  fact 
that  the  read_inputs()  function  of  the  ray-intersection  calculation  is  connected  to  calc_outputs() 
with  eight  data  buses  rather  than  four,  which  signifies  that  twice  the  number  of  inputs  are  passed 
from  one  function  to  the  next.  The  fact  that  the  ray-intersection  calculation  is  encapsulated  in  a 
single  foreachloop  rather  than  one  loop  for  each  function  has  no  functional  significance.  The 
organization  of  the  ray-intersection  calculation’s  host  program  did  not  differ  significantly  from 
the  normal-vector  calculation’s  host  program.  The  host  program  code  is  listed  in  appendix  C  on 
page  60 

4.5  The  Sequential  Program 

This  project  set  out  to  compare  throughput  and  power  consumption  between  FPGAs  and  traditional 
processors.  ANSTC  running  on  the  Opteron  275s  was  used  as  the  traditional  implementation  in 
this  project.  The  GNU-C  compiler  was  used  to  compile  the  code  and  the  flag  -O  was  used  because 
it  instructs  the  compiler  to  turn  on  forms  of  optimization  that  do  not  require  any  trade-off  between 
speed  and  space. 

The  code  of  the  sequential  implementations  of  the  ray-intersection  calculation  and  the  normal- 
vector  calculation  are  listed  in  appendix  E  on  page  74  and  F  on  page  79,  respectively.  The  se¬ 
quential  implementations  were  designed  to  mirror  the  host  program  implementations  as  much  as 
possible.  Data  generation  was  identical.  Time  measurements  using  the  clock()  function  only  mea¬ 
sured  the  time  when  actual  arithmetic  calculations  were  made.  For  the  sequential  programs  as  well 
as  the  host  programs  the  number  of  samples  calculated  and  the  number  of  times  the  calculation 
portion  of  the  program  was  looped  varied  in  order  to  ensure  that  in  each  case,  the  time  required  to 
process  2^®  samples  was  measured. 

Figure  4.10  on  the  next  page  contrasts  the  dataflow  of  the  sequential  implementation  to  that  of 
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the  FPGA  implementation,  which  uses  both  an  FPGA  design  and  an  ANSI-C  host  program  (see 
figure  4.5  on  page  27).  The  dataflow  of  the  sequential  program  is  clearly  much  simpler,  since 
the  sequential  program  only  interacts  with  the  host  memory,  while  the  host  program  of  the  FPGA 
implementation  must  move  data  between  both  the  host  memory  and  the  FPGA  memory.  However, 
the  delays  incurred  by  data  transfers  in  the  FPGA  implementation  are  negligible  compared  to 
overall  throughput. 


ANSI-C  Sequential  Program 


Figure  4.10:  Data  flow  in  sequential  program. 
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Chapter  5 
Results 


This  project  sought  to  measure  the  cost  versus  benefit  of  using  reconfigurable  over  conventional 
processors  for  HPC.  The  primary  concern  for  most  users  of  HPC  is  throughput.  Most  research 
comparing  FPGAs  to  CPUs  focus  on  this  aspect  of  performance.  However,  modem  HPC  systems 
generate  significant  amounts  of  power  and  heat.  Overheating  can  even  force  a  system  to  shut  off 
or  reset,  which  may  cause  researchers  to  lose  data  or  have  to  reprogram  experiments.  Power  and 
heat  are  also  important  to  field  applications  of  FPGA  processing,  where  available  power  cooling 
abilities  may  be  limited. 

Throughput  was  measured  using  the  clock()  function  of  ANSTC.  The  measurements  were 
straightforward  and  are  further  discussed  in  section  5.2  on  page  38.  Power  and  heat  measurements 
were  taken  using  the  Cray  XDl’s  built-in  Active  Management  System  (see  4.3  on  page  25),  a  mon¬ 
itoring  and  control  tool  for  system  administrators  and  end  users  [33].  This  tool  was  able  to  isolate 
and  accurately  measure  power  delivered  to  both  the  Opteron  and  FPGA,  but  the  heat  resulting 
could  not  be  isolated  with  the  instrumentation  provided  with  the  system.  As  a  result  the  costs  ver¬ 
sus  benefit  analysis  achieved  in  this  project  compared  power  to  throughput.  Resource  consumption 
by  the  FPGA  implementations  was  also  measured,  but  was  not  part  of  the  cost-benefit  analysis,  as 
explained  below,  although  it  would  be  a  suitable  focus  of  further  research.  In  addition,  no  attempt 
was  made  to  measure  the  time  taken  to  implement  the  calculations,  because  any  such  measurement 
would  have  required  a  skilled  user  of  Mitrion-C. 


5.1  Resource  Consumption 

One  cost  commonly  associated  with  the  use  of  FPGAs  is  resource  consumption.  Since  FPGAs 
are  reconfigurable,  each  design  that  is  implemented  on  one  uses  different  onboard  components  to 
implement  logic.  One  part  of  hardware  design,  discussed  in  section  2.2  on  page  10,  is  the  loop  back 
to  code  design  after  synthesis.  VHDL  programmers  want  to  maximize  the  resource  consumption 
of  their  designs  in  order  to  achieve  maximum  throughput  given  a  set  of  hardware  constraints.  The 
synthesis  step  of  hardware  design  produces  a  detailed  analysis  of  expected  resource  allocation, 
which  the  hardware  designer  uses  to  make  changes  to  a  design. 
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Figure  5.1:  Mitrion-C  simulation  of  calc_ouputs()  function  of  ray-intersection  simulation. 
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Table  5.1:  Ray-intersection  resource  consumption. 


Resource  (total) 

Implemented  (Percent) 

Slices  (23  616) 

19044  (81%) 

Flip  Flops  (47  232) 

26  508  (56%) 

4-input  FUTs  (47  232) 

26  250  (56%) 

Block  RAMs  (232) 

25  (11%) 

Multipliers  (232  18x18) 

72  (31%) 

Table  5.2:  Normal-vector  resource  consumption. 


Resource  (total) 

Implemented  (Percent) 

Slices  (23  616) 

16  571  (70%) 

Flip  Flops  (47  232) 

21670  (46%) 

4-input  FUTs  (47  232) 

20466  (43%) 

Block  RAMs  (232) 

23  (10%) 

Multipliers  (232  18x18) 

48  (21  %) 

Although  Mitrion-C  allows  explicit  definition  of  parallelism,  it  does  not  permit  the  same  fine¬ 
grained  control  over  a  design’s  resource  use  as  a  traditional  HDL  does,  section  2.5  on  page  15 
describes  the  many  considerations  that  must  be  taken  into  account  when  a  hardware  designer  cre¬ 
ates  a  design.  In  Mitrion-C,  scheduling  is  automated  and  opaque  to  the  user.  Mitrion’s  simulator 
shows  when  arithmetic  operations  are  begun,  giving  an  indication  of  how  scheduling  is  done.  In 
figure  5.1  on  the  previous  page,  each  interval  of  space  in  the  vertical  direction  corresponds  with  a 
unit  of  time,  that  is,  one  computational  cycle.  Analysis  of  this  flow  shows  that  the  Mitrion  com¬ 
piler  uses  As-Late-As-Possible  scheduling,  described  in  section  2.5  on  page  15  and  illustrated  in 
figure  2.5  on  page  15.  However,  the  simulation  output  also  seems  to  indicate  that  with  the  exception 
of  division,  every  arithmetic  operator  can  be  completed  in  just  one  computational  cycle.  Experi¬ 
ence  with  other  floating-point  operator  implementations  makes  it  seem  unlikely  this  is  actually  the 
case. 

The  simulator  output  also  shows  no  use  of  modulo  scheduling,  described  in  section  2.5  on 
page  15.  Implementation  of  modulo  scheduling  would  lead  to  multiple  numbers  of  floating-point 
units  being  implemented  based  on  need.  However,  the  simulation  output  indicates  that  one  floating¬ 
point  operator  was  implemented  for  each  operation  completed.  The  ability  of  the  Mitrion  Virtual 
Processor  to  generate  one  output  per  computational  cycle  to  a  complex  problem  such  as  the  ray- 
intersection  calculation  using  the  schedule  that  is  illustrated  in  figure  5.1  on  the  previous  page 
seems  highly  unlikely.  It  seems  more  probable  that  the  simulator  merely  produces  a  simple  ALAP- 
scheduled  representation  of  the  logic  for  debugging  purposes,  and  that  more  complex  scheduling 
and  optimization  is  not  visible  to  the  user.  The  VHDL  code  the  Mitrion  compiler  generates  is 
complex  and  not  designed  to  be  analyzed  by  the  end  user.  Attempts  to  do  so  were  unsuccessful. 
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Table  5.3:  Ray-intersection  throughput  measurements. 


Opteron  275 

Virtex-II  Pro 

Rays  Traced 

1073  741824 

Time  (s) 

219.54 

21.49 

Throughput  (rays/s) 

4.891  X  10^ 

4.996  X  10^ 

Speedup 

— 

10.21X 

— 

921% 

Tables  5. 1  on  the  preceding  page  and  5.2  on  the  previous  page  report  the  resources  consumed  by 
the  ray-intersection  and  normal-vector  calculations,  respectively.  Xilinx  defines  slices  as  the  basic 
configurable  logic  unit  within  an  FPGA.  Each  one  contains  two  4-input  lookup  tables  (LUTs)  and 
two  flip-flops.  The  lookup  tables  are  used  to  implement  simple  logic  equations  and  the  flip-flops 
are  used  to  hold  the  outputs  of  the  lookup  tables  when  that  is  required.  In  both  designs,  well  over 
50%  of  available  slices  were  used  by  the  designs.  It  was  therefore  not  possible  using  Mitrion-C  to 
implement  multiple  instances  of  the  designs  in  order  to  obtain  greater  throughput  because  doing  so 
would  have  required  twice  as  many  slices. 


5.2  Throughput  Measurement 

As  discussed  in  section  4.4  on  page  29,  the  FPGA  implementation  of  the  ray-intersection  calcula¬ 
tion  could  process  262  144  samples  before  needing  new  inputs  while  the  normal- vector  calculation 
could  process  524288  samples.  The  ray-intersection  calculation  was  iterated  by  the  host  program 
8192  times  and  the  normal- vector  calculation  4096  times.  Each  time  measurement  therefore  mea¬ 
sured  the  time  needed  to  process  1 073  741  824=2^®  rays. 

The  time  functions  built  into  ANSTC  were  used  for  time  measurement.  The  clock  ( )  function 
returns  the  system  time  given  in  clock  ticks  relative  to  an  arbitrary  reference  time.  The  time  was 
measured  before  starting  the  FPGA  program  and  again  once  all  iterations  were  complete.  By 
subtracting  the  start  value  from  the  end  value  and  dividing  by  the  macro  CL0CKS_PER_SEC,  which 
stores  the  number  of  clock  ticks  per  second  measured  by  clockO,  time  elapsed  in  seconds  could 
be  calculated.  For  the  sequential  implementations  of  the  two  calculations,  the  time  was  measured 
before  the  loop  that  encapsulated  the  arithmetic  portion  of  the  program  began  and  again  after  it 
completed.  The  results  of  the  measurements  are  presented  in  tables  5.3  and  5.4  on  the  following 
page. 

The  maximum  amount  of  speedup  achieved — 10.67x —  seemed  reasonable  compared  to  past 
research.  The  results  also  seemed  to  show  that  the  throughput  indicated  during  simulation — that  is, 
one  full  calculation  delivered  per  cycle  in  the  case  of  the  normal- vector  calculation — was  correctly 
implemented.  The  Mitrion  Virtual  Processor  has  been  shown  to  run  on  the  Cray  XDl  at  100  MHz 
[26].  Therefore,  the  minimum  time  t  required  to  complete  1  073  741  824  calculations  can  be  found 
using  the  equation: 


39 


Table  5.4:  Normal-vector  throughput  measurements. 


Opteron  275 

Virtex-II  Pro 

Rays  Traced 

1073  741824 

Time  (s) 

114.79 

10.75 

Throughput  (rays/s) 

9.354  X  10^ 

9.988  X  10^ 

Speedup 

— 

10.67X 

— 

967% 

t 


Is 

100  X  10^  clock  cycles 


1  clock  cycle 

— ; — ; — ^ - X  1073741 824calculations 

calculation 


10.74s 


(5.1a) 


As  discussed  in  section  4.4.2  on  page  32,  the  FPGA  implementation  of  the  ray-intersection 
calculation  required  two  clock  cycles  for  each  calculation.  If  equation  5.1a  were  applied  to  the 
ray-intersection  calculation,  the  minimum  time  t  would  equal  21.47  seconds.  The  expected  value 
of  elapsed  time  for  both  the  normal-vector  and  ray-intersection  calculations  were  essentially  equal 
to  the  measured  values  listed  in  tables  5.4  and  5.3  on  the  previous  page.  This  result  supports  the 
likelihood  that  modulo  scheduling  is  automatically  implemented  by  the  Mitrion  compiler  because 
only  a  schedule  with  a  modulus  of  one  could  have  achieved  maximum  theoretical  throughput. 

The  fact  that  the  ray-intersection  calculation  achieved  the  same  speedup  as  the  normal-vector 
calculation  was  surprising.  As  described  in  section  5.3  on  the  preceding  page,  the  implementation 
of  the  ray-intersection  calculation  only  used  two  QDR  SRAMs,  though  theoretically  it  could  have 
used  four.  The  fact  that  the  throughput  of  the  sequential  program  decreased  about  the  same  amount 
as  the  FPGA  implementation  seems  to  indicate  that  the  sequential  program  was  also  limited  by 
memory  bandwidth.  However,  the  results  suggest  that  the  FPGA  implementation  should  have  been 
able  to  double  its  throughput  had  four  memories  been  used  instead  of  two.  If  such  a  implementation 
could  be  developed,  a  greater  than  20  x  speedup  seems  possible. 


5.3  Power  Measurement 

Power  consumption  was  measured  using  Cray’s  Hardware  Supervisory  Subsystem  (HSS)  software, 
which  runs  on  the  management  processor  of  each  Cray  XDl  chassis  and  is  designed  to  monitor  the 
health  of  the  system  [33].  The  HSS  reports  the  voltage  supplied  to  the  regulators  of  each  node  and 
the  current  supplied  by  the  regulators  to  individual  components  of  the  node,  including  the  Opterons 
and  FPGAs.  Because  both  the  Opterons  and  FPGAs  draw  power  even  when  idle,  monitoring  total 
power  (in  watts)  was  of  greatest  interest.  For  this  reason  power  measurements  were  taken  both  on 
nodes  with  an  FPGA  present  and  on  nodes  without.  In  all,  five  power  levels  for  each  of  the  two 
calculations  implemented  were  measured: 

1.  a  node  without  an  FPGA  while  idle. 
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2.  a  node  without  an  FPGA  while  running  the  sequential  implementation, 

3.  a  node  with  an  FPGA  while  idle, 

4.  a  node  with  an  FPGA  while  running  the  sequential  implementation,  and 

5.  a  node  with  an  FPGA  while  running  the  FPGA  implementation. 


Table  5.5:  Ray-intersection  power  measurements. 


Node  Type 

Implementation 

Total  Power  (watts) 

No  FPGA 

Idle 

102.65 

FPGA 

Idle 

130.94 

No  FPGA 

Sequential  Only 

110.87 

FPGA 

Sequential  Only 

139.57 

FPGA 

FPGA 

142.87 

Table  5.6:  Normal-vector  power  measurements. 


Node  Type 

Implementation 

Total  Power  (watts) 

No  FPGA 

Sequential  Only 

111.84 

FPGA 

Sequential  Only 

139.84 

FPGA 

FPGA 

143.66 

Three  sets  of  100  samples  each  were  measured  at  2  s  intervals.  The  full  datasets  are  displayed 
in  figures  5.2  on  the  next  page  through  5.9  on  page  45.  By  comparing  the  mean  values  and  standard 
deviations  between  sets,  it  was  found  that  power  was  independent  of  time,  that  is,  stationary  at  the 
scale  of  the  measurements.  Tables  5.5  and  5.6  present  the  mean  values  of  the  findings  across  all 
300  samples  in  each  case.  Idle  power  measurements  are  omitted  from  Table  5.6  because  they  were 
equal  to  the  values  presented  in  Table  5.5. 

In  the  case  of  the  ray-intersection  calculation,  the  Virtex-II  Pro  implementation  required  1 .285  x 
the  power  of  the  sequential  program  running  on  a  node  with  no  FPGA  (a  28.5%  increase)  and 
1.027  X  the  power  of  the  sequential  program  running  on  a  node  with  an  FPGA  (a  2.7%  increase). 
For  the  normal-vector  calculation,  the  FPGA  implementation  required  1.259x  and  1.01  lx  the 
power  of  the  sequential  program  (increases  of  25.9%  and  1.1%),  respectively. 

The  background  power  required  for  any  system  will  vary  based  on  the  particular  operating 
system  or  other  processes  that  are  running  in  addition  to  the  calculation  of  interest.  Processing  unit 
power  can  be  isolated  from  background  power  by  calculating  the  ratio 

■fpPGA  ~ -fpPGA.IDLE  (5  2) 

^SEQ  ~^SEQ,IDLE 
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No  FPGA  Node:  Idle 


Sample  Number  (2  sec  interval) 


Figure  5.2:  Background  power  measurements  of  a  node  without  an  FPGA. 


where  /seq  is  the  power  consumed  by  the  sequential  program  in  a  node  with  no  FPGA  attached, 
Fseq.idle  is  the  power  consumed  in  the  same  node  when  the  sequential  program  is  not  executing 
(although  the  operating  system’s  instructions  will  still  be  executing  in  that  node),  /fpga  is  the 
power  consumed  by  the  parallel  hardware  design  in  an  FPGA  in  a  node  with  the  FPGA  attached, 
and  Pepga.idle  is  the  power  consumed  by  that  same  node  when  the  FPGA  does  not  contain  the 
parallel  design,  so  is  idling.  The  result  was  1.240x  power  consumption  (24.0%  increase)  for  the 
ray-intersection  calculation  and  1.451  x  (45.1%  increase)  for  the  normal- vector  calculation. 
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Figure  5.3:  Normal- vector  calculation  implemented  with  only  an  Opteron  275  on  a  node  without 
an  FPGA. 


Figure  5.4:  Ray-intersection  calculation  implemented  with  only  an  Opteron  275  on  a  node  without 
an  FPGA. 
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Figure  5.5:  Background  power  measurements  of  a  node  with  an  FPGA. 


Figure  5.6:  Normal-vector  calculation  implemented  with  only  an  Opteron  275  on  a  node  with  an 
FPGA. 
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Figure  5.7:  Normal-vector  calculation  implemented  with  a  Virtex-II  Pro  and  an  Opteron  275. 


Node  with  FPGA:  Ray-Intersection 
Sequential  Implementation 

140.00  - — 

120.00  I - 


Sample  Number  (2  sec  interval) 


Figure  5.8:  Ray-intersection  calculation  implemented  with  only  an  Opteron  275  on  a  node  with  an 
FPGA. 
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Node  with  FPGA:  Ray-Intersection 
FPGA  Implementation 


Sample  Number  (2  sec  interval) 


Figure  5.9:  Ray-intersection  calculation  implemented  with  a  Virtex-II  Pro  and  an  Opteron  275. 
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Chapter  6 
Conclusion 


In  this  paper,  the  acceleration  of  two  portions  of  the  optical  simulation  of  NASA’s  Moderate  Reso¬ 
lution  Imaging  Spectroradiometer  was  presented.  Mitrion-C  HLL  was  used  to  implement  hardware 
designs  on  Virtex-II  Pro  FPGAs.  A  functionally  equivalent  program  was  written  using  ANSI-C  and 
implemented  on  an  Advanced  Micro  Devices  Opteron  275  processor. 

Throughput  and  power  of  all  implementations  were  measured  on  the  Cray-XDl  supercomputer. 
Recent  marketing  literature  from  Mitrionics  AB — the  developer  of  Mitrion-C — has  claimed  the 
ability  of  FPGAs  to  process  at  speeds  of  up  to  100  times  faster  than  sequential  processors  and  to 
use  only  2%  as  much  power  when  operating  at  the  same  speed  as  sequential  processors  [30].  The 
maximum  speedup  measured  in  this  project  was  10.67  x,  or  a  967%  increase.  This  speedup  was 
measured  using  only  two  of  the  FPGA’s  four  memories  for  input.  It  is  predicted  that  the  measured 
speedup  would  have  been  doubled  had  all  four  memories  been  used.  However,  the  feasibility  of 
such  an  implementation  from  a  resource  and  power  consumption  standpoint  are  unknown. 

The  maximum  power  increase  required  to  run  an  FPGA  was  measured  to  be  45.1%  when 
power  consumed  by  the  processing  unit  was  isolated.  However,  many  researchers  may  be  more 
interested  in  total  power  consumption  because  of  overall  heat  and  cost  limits.  Taking  background 
power  into  account,  the  maximum  power  increase  required  to  run  an  FPGA  was  measured  to 
be  28.5%.  Throughput  and  power  are  presented  separately  as  benefit  and  cost  because  different 
applications  may  weight  different  factors  more  heavily,  and  so  no  one  direct  comparison  would  be 
comprehensive. 

The  results  showed  that  floating-point  operations  using  FPGAs  offer  significant  speedup  over 
sequential  processor  implementations  without  excessive  additional  power  consumption.  It  was  also 
shown  that  high-level  languages  such  as  Mitrion-C  can  reduce  development  times  and  the  need  for 
extensive  experience  with  hardware  design  and  still  achieve  efficient  FPGA  use. 

The  greatest  disadvantage  observed  in  this  research  to  using  FPGAs  for  high-performance  com¬ 
puting  was  the  need  for  a  sequential  host  program  to  feed  new  data  to  the  FPGA.  Even  when  the 
sequential  program  was  not  contributing  directly  to  the  calculations,  it  continued  to  consume  a  sig¬ 
nificant  amount  of  power.  There  are  two  ways  to  avoid  this  problem:  (1)  eliminate  the  sequential 
processor  and  find  another  way  to  feed  data  to  the  FPGA  or  (2)  use  the  sequential  processor  both 
to  feed  data  to  the  FPGA  and  to  perform  some  of  the  calculations  in  parallel  with  the  FPGA. 
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